1 About the project

Introduction to Open Data Science is a course combining open data, open science and data science. The course aims to provide tools and techniques for visualizing, analyzing and interpreting data using the methods of data science. Open research tools such as R, R Markdown and RStudio are used to produce dynamic and reproducible reports combining text and R code, which are published on GitHub pages. Reproducible research makes scientific research transparent by providing access to both data and code, which allows anyone to evaluate how data analysis has been done.

These pages are a learning diary to document progress on the course.

GitHub pages: https://roninkoi.github.io/IODS-project
GitHub repository: https://github.com/Roninkoi/IODS-project

1.1 Initial thoughts

I found out about this course from the doctoral school mailing list, and thought it would help me improve my data analysis skills and teach me how to make scientific research more open. I’m interested in learning R, and this course is a good chance to do so. I have been using Python with Matplotlib/NumPy/SciPy to visualize and analyze data previously, so I’m curious to see the benefits of using R.

1.2 Learning R (week 1)

After reading chapters 1-4 of R for Health Data Science (Harrison and Pius 2021), I have found it to be a useful introduction to R so far. It’s very good that this book is openly available and itself written using R. It focuses quite heavily on practical examples, which has been good for learning. I had the most difficulty with chapter 2 on R basics. The way some things are done in R seemed a bit unusual and opaque. I was most interested in finding out how R differs from other languages, since I’m familiar with most basic concepts. The most interesting chapter so far has been chapter 4 on plotting, and I’ve always found making plots to be fun.

The RStudio IDE makes working with R and R Markdown easy, but I personally find it a bit clumsy compared to a simple text editor. The main benefit of RStudio has been easy editing of R Markdown, with the ability to evaluate single R blocks and visualization of the results (tables, plots).


2 Regression and model validation (week 2)

Academic performance of students depends on many factors, which may be difficult to measure and model accurately. The grade that a student gets may be attributable to qualities of each student, e.g. motivation and previous knowledge, but the exact relationship between these is not directly measurable. Linear regression is one of the simplest ways to quantify the relationship \(f\) between predictor variables \(X\) and a response variable \(Y\) using a linear model \(\hat Y = \hat f(X)\) (James et al. 2021). By fitting a model to a data set, we can determine the prediction \(\hat Y\) by finding an estimate for \(\hat f\). If we have the appropriate data set available, it should be possible to roughly predict how well a student performs using a model.^

A data set on student learning was created for regression analysis. This data set is based on student learning data1 collected in 2014-2015, which describes student learning attitudes and outcomes. The data includes scores from surveys answered by the students, which are used to construct variables that describe different aspects of learning.

An R script was created in data/create_learning2014.R to generate the data set. The raw data contains 183 rows and 60 variables (described in more detail in the code). We include student gender, age and exam points from the raw data in our data set. Additionally, we create variables based on the description in the metadata2. The data contains a number of questions \(Q\) on different aspects of learning, which are classified into groups: deep learning \(Q_\text{D}\), surface learning \(Q_\text{SU}\) and strategic learning \(Q_\text{ST}\). Scores for each group are calculated by simply taking the mean over each set of questions: \(\text{D} = \overline Q_\text{D}, \ \text{SU} = \overline Q_\text{SU}, \ \text{ST} = \overline Q_\text{ST}.\) This is done similarly for questions on attitude towards statistics, which we also include in our data set. Now our data set consists of the 7 variables gender, age, attitude, deep, stra, surf and points. Finally, we filter the data such that it only contains students whose exam points are larger than 0, which leaves us with 166 students The finalized data set is written to data/learning2014.csv.

After loading the data set, we plot the scatter matrix (figure 2.1) for the data set to get an overview of how each variable is correlated to each other. This figure contains histograms for each variable, as well as correlation between variables (scatter plots in lower half and coefficients in upper half). Summary statistics for each variable are shown in table 2.1. There are 166 students in the data set (66% women), aged between 17 and 55 (median age 22). Almost all of the students are likely undergraduates, with few older people as shown by the sharply peaked age histogram in figure 2.1. Histograms for the learning variables are roughly normal, except for attitude and points. We can see that these distributions are not quite unimodal, with attitude having a peak at the lower end (for men) and points having peaks at both the lower and higher end. This suggests that there are many students that score low or high, with fewer students in between than might be expected.

In figure 2.1, we can look at the correlations between variables to determine which are most significant for the exam scores. We can see that the most strongly correlated variable is attitude, which has a positive correlation coefficient of 0.437 (marked by three stars, indicating high confidence level). This means that for high attitude scores, we are likely to see high exam scores as well. We can even see that the shape of the points histogram slightly resebles the shape of the attitude one. The second and third most significant correlations are with stra (positive) and surf (negative). This suggests that a high strategic learning score is likely to lead to high exam scores, while surface learning is not. These two scores are also negatively correlated with each other, meaning that they represent somewhat opposite learning attitudes. Similarly, we see that a deep learning score is very negatively correlated with surf.`

# load data set
learning14 <- read_csv("./data/learning2014.csv")

# plot scatter matrix of variables
sm <- ggpairs(learning14, mapping = aes(colour=gender, alpha=0.7), 
              lower = list(combo = wrap("facethist", bins = 20)))
sm
Scatter matrix of the learning2014 data set.\label{fig:sm}

Figure 2.1: Scatter matrix of the learning2014 data set.

# get summaries for different variables
age_summary <- summary(learning14$age)
attitude_summary <- summary(learning14$attitude)
deep_summary <- summary(learning14$deep)
surf_summary <- summary(learning14$surf)
stra_summary <- summary(learning14$stra)
points_summary <- summary(learning14$points)

# calculate number of students
n_students <- nrow(learning14)
n_men <- nrow(filter(learning14, gender == "M"))
n_women <- nrow(filter(learning14, gender == "F"))

kable(data.frame("Age"=c(age_summary[1], age_summary[3], 
                         age_summary[4], age_summary[6]),
                 "Attitude"=c(attitude_summary[1], attitude_summary[3], 
                              attitude_summary[4], attitude_summary[6]),
                 "Deep"=c(deep_summary[1], deep_summary[3], 
                          deep_summary[4], deep_summary[6]), 
                 "Surface"=c(surf_summary[1], surf_summary[3], 
                             surf_summary[4], surf_summary[6]), 
                 "Strategic"=c(stra_summary[1], stra_summary[3], 
                               stra_summary[4], stra_summary[6]), 
                 "Points"=c(points_summary[1], points_summary[3], 
                            points_summary[4], points_summary[6])), 
                 caption = sprintf("Summary of different variables for %i students (%i men, %i women).", 
                                   n_students, n_men, n_women))
Table 2.1: Summary of different variables for 166 students (56 men, 110 women).
Age Attitude Deep Surface Strategic Points
Min. 17.00000 1.400000 1.583333 1.583333 1.250000 7.00000
Median 22.00000 3.200000 3.666667 2.833333 3.187500 23.00000
Mean 25.51205 3.142771 3.679719 2.787149 3.121235 22.71687
Max. 55.00000 5.000000 4.916667 4.333333 5.000000 33.00000

We want to fit a linear model \(\hat f(x_1, x_2, x_3) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\) to the learning2014 data set with 3 variables. To choose these variables, we first start by fitting all 7 of them. Fitting the linear model is done by using the lm() function. A summary of the fit is generated to get the coefficient values, p-values and \(R^2\). The variable with the largest p-value is removed from the model, and the model is fitted again. We continue this process of elimination until only 3 variables are left: age, attitude and stra. The model with all variables “All” and the model with 3 variables “Linear” are shown in table 2.2 with their coefficients, p-values and \(R^2\). We can see that the 3 variable model has lower p-values for age and attitude.

The p-values reported for coefficients are calculated using the \(t\)-test. We want to minimize the p-values for the coefficients of our fit in order to establish statistical significance. For our model’s coefficients to be statistically significant, we would typically like the \(p\)-value to be less than \(5%\). Similarly, we want to maximize \(R^2\), which is a measure of how good our model is at predicting the response based on the predictor variables (goodness of fit). It is defined using the errors of the fit as \(R^2 = 1 - \text{RSS} / \text{TSS}\), where \(\text{RSS}\) is the residual sum of squares and \(\text{TSS}\) is the total sum of squares (James et al. 2021). By decreasing \(SS_\text{res}\), we improve the fit and move \(R^2\) closer to 1.

The linear model predicts the exam points using the equation \(\hat f = 10.90 - 0.09 \ \text{age} + 3.48 \ \text{attitude} + 1.00 \ \text{stra}\) (table 2.2). The variable attitude is the most significant, followed by stra, accounting for the majority of the points. This suggests that positive attitudes towards the subject and strategic skills (organization, time management, …) play a significant factor in positive learning outcomes, which is an expected association. Age correlates slightly negatively with points, suggesting that younger people are more likely to earn higher scores. It could be that older people have less time to devote to studying, but other factors may also contribute to this.

Only the attitude parameter in the linear model has a p-value of less 5%, while strategy is fairly close at 6% and age at 10%. The \(R^2\) score for the linear model is 0.218, so the model predicts about 22% of the score based on the chosen variables. The rest can’t be explained using the model, so they are due to external factors not accounted for by the model.

If we don’t constrain ourselves to a linear model, we can improve the fit while still only using 3 variables. We can extend the model to account for non-linear terms for the variables: \(\hat f(x_1, x_2, x_3) = \beta_0 + \beta_1 x_1^i + \beta_2 x_2^j + \beta_3 x_3^k\). We can try to find a better fit by varying the powers \(i, j, k\) and picking the model with the lowest \(R^2\) value and desired p-value. By searching \(i, j, k \in [0, 10]\), we found the best fit with \(\text{age}^7+\text{attitude}+\text{stra}^5\). This model has a 5% significance level for all variables, as well as an \(R^2\) coefficient of 0.26, as shown in table 2.2. However, this more complicated model is difficult to justfify for such a limited data set, as the linear model is simpler and performs adequately. Since the model doesn’t have a test data set, problems of overfitting can be a concern.

Comparisons for exam point histograms are shown in figure 2.2. These figures show the real exam point data with the predicted distribution from the linear model and best fit model. As can be seen, the point distribution is reproduced quite well, with the mean of the distribution roughly correct. Behavior at the tails shows the positions of the peaks (low and high), but their magnitude is overestimated.

# plot histogram of points for data and fit
fit_hist <- function (data, fit, title) {
  points <- data.frame(Points=c(data$points, predict(fit)),
                     Distribution=c("Data", "Fit"))

  ggplot(points, aes(Points, fill = Distribution)) + 
    geom_density(alpha = 0.5) + ylab("Density") + ggtitle(title)
}

# first fit full 7 variable linear model and remove variables based on p-value
fit_full <- lm(points ~ ., data = learning14)
fit_full_summary <- summary(fit_full)

# linear model with 3 most significant variables: age, attitude, stra
fit_lin <- lm(points ~ age + attitude + stra, data = learning14)
fit_lin_summary <- summary(fit_lin)
hist1 <- fit_hist(learning14, fit_lin, title = "age + attitude + stra")

# try to find better model by fitting x^p with varying p
max_r2 <- 0
max_ijk <- c(0, 0, 0)
fit_best <- fit_lin
pn <- 10
for (i in 1:pn) {
  for (j in 1:pn) {
    for (k in 1:pn) {
      # fit model with 3 parameters and varying powers
      fit_ijk <- lm(points ~ I(age^i) + I(attitude^j) + I(stra^k), 
                    data = learning14)
      fit_summary <- summary(fit_ijk)
      
      # is p-value low enough for all parameters?
      good_p <- !any(array(summary(fit_ijk)$coefficients[,4]) > 0.05)
      
      # if r2 better than best found so far, accept this fit
      if (fit_summary$r.squared > max_r2 && good_p) {
        max_r2 <- fit_summary$r.squared
        max_ijk <- c(i, j, k)
        fit_best <- fit_ijk
      }
    }
  }
}

fit_best_summary <- summary(fit_best)
message("best fit: (", max_ijk[1], ", ", max_ijk[2], ", ", max_ijk[3], 
        "), R2: ", max_r2)

hist2 <- fit_hist(learning14, fit_best, 
                  sprintf("age^%i+attitude^%i+stra^%i", 
                          max_ijk[1], max_ijk[2], max_ijk[3]))

grid.arrange(hist1, hist2, ncol=2, respect=T)
Scatter matrix of the learning2014 data set.\label{fig:hist}

Figure 2.2: Scatter matrix of the learning2014 data set.

fe <- data.frame("Fit"=c("Intercept", "Intercept p-value", 
                         "Age", "Age p-value", 
                         "Attitude", "Attitude p-value", 
                         "Strategy", "Strategy p-value", "R^2"),
                 "All"=c(fit_full_summary$coefficients[1,1], 
                         fit_full_summary$coefficients[1,4], 
                         fit_full_summary$coefficients[2,1], 
                         fit_full_summary$coefficients[2,4], 
                         fit_full_summary$coefficients[3,1], 
                         fit_full_summary$coefficients[3,4], 
                         fit_full_summary$coefficients[4,1], 
                         fit_full_summary$coefficients[4,4], 
                         fit_full_summary$r.squared),
                 "Linear"=c(fit_lin_summary$coefficients[1,1], 
                            fit_lin_summary$coefficients[1,4], 
                            fit_lin_summary$coefficients[2,1], 
                            fit_lin_summary$coefficients[2,4], 
                            fit_lin_summary$coefficients[3,1], 
                            fit_lin_summary$coefficients[3,4], 
                            fit_lin_summary$coefficients[4,1], 
                            fit_lin_summary$coefficients[4,4], 
                            fit_lin_summary$r.squared),
                 "Best"=c(fit_best_summary$coefficients[1,1], 
                          fit_best_summary$coefficients[1,4], 
                          fit_best_summary$coefficients[2,1], 
                          fit_best_summary$coefficients[2,4], 
                          fit_best_summary$coefficients[3,1], 
                          fit_best_summary$coefficients[3,4], 
                          fit_best_summary$coefficients[4,1], 
                          fit_best_summary$coefficients[4,4], 
                          fit_best_summary$r.squared))
fe <- format_engr(fe)
kable(fe, caption = "Summary of different fits.")
Table 2.2: Summary of different fits.
Fit All Linear Best
Intercept \(18.30\) \(10.90\) \(11.34\)
Intercept p-value \({644.1}\times 10^{-6}\) \({61.72}\times 10^{-6}\) \({1.489}\times 10^{-9}\)
Age \({-58.58}\times 10^{-3}\) \({-88.22}\times 10^{-3}\) \({-6.264}\times 10^{-12}\)
Age p-value \({949.8}\times 10^{-3}\) \({98.05}\times 10^{-3}\) \({1.687}\times 10^{-3}\)
Attitude \({-96.07}\times 10^{-3}\) \(3.481\) \(3.462\)
Attitude p-value \({76.93}\times 10^{-3}\) \({4.719}\times 10^{-9}\) \({2.928}\times 10^{-9}\)
Strategy \(3.443\) \(1.004\) \({1.766}\times 10^{-3}\)
Strategy p-value \({42.35}\times 10^{-9}\) \({62.12}\times 10^{-3}\) \({21.25}\times 10^{-3}\)
R^2 \({231.2}\times 10^{-3}\) \({218.2}\times 10^{-3}\) \({259.6}\times 10^{-3}\)

Diagnostic plots were produced for the linear model, shown in figure 2.3. The model assumes that variance is constant for the data set. Based on the residuals vs. fitted values plot, this is not exactly correct. There is a slight change in the spread of the residuals with respect to the fitted value, which suggests that the assumption of constant variance is not true (Vehkalahti and Everitt 2019). Since the QQ residuals plot is not a straight line, but curved at the ends with a step on the left, this suggests that the data is not exactly normally distributed, but skewed, heavy-tailed at one end and bimodal. The residuals vs. leverage plot shows the significance of individual data points for the fit coefficients. Most random errors should have low leverage, but there are a few outliers in the data set with high leverage and residuals. Data points that are outliers could potentially be errors in the data set.

par(mfrow=c(1, 3))
plot(fit_lin, which=c(1, 2, 5))
Diagnostic plots for linear model.\label{fig:diag}

Figure 2.3: Diagnostic plots for linear model.


3 Logistic regression (week 3)

Logistic regression allows us to predict discrete outcomes based on explanatory variables, as opposed to linear regression where the output of the model is continuous (James et al. 2021). An example of this would be whether a student passes a class (true/false), while in linear regression we could predict the grade of the student. However, simple linear regression is not very suitable for classification tasks as it is not very reliable. Logistic regression instead models the conditional probability \(p(X)\) of an outcome given the explanatory variables \(X\) as \(\exp\left(\frac{\hat p(X)}{1 - \hat p(X)}\right) = \hat f(X)\), which is then used to determine whether that outcome is likely to occur or not.

We are using student performance data3 from questionnaires and school reports to model student alcohol consumption based on data about the background and academic performance of the students. The data contains variables such as age, sex, academic performance (grades, attendance, failures), family background (job, education, support), personal data (health, alcohol consumption, how much going out with friends) among other variables. All of the variables in the data set are printed in the code snippet below. This data gives an overview on the life of each student, which can be used to predict habits such as alcohol consumption. The students will be classified into two groups: low alcohol use and high alcohol use. High alcohol use is defined as the mean of weekday and and weekend alcohol use exceeding some limit (2). To generate the data set, we use the R script data/create_alc.R. This script joins two data sets, collected from mathematics and Portuguese language students, to create a single data set consisting of students attending both classes. The data sets are joined by identifying each student using the background questions, and calculating the mean of the grades (and other class-specific variables) of the two classes for each student. The resulting data set better captures the academic performance of each student across subjects. The final data set consists of 35 variables and 370 students.

# load data set
alc <- read_csv("./data/alc.csv", show_col_types = F)

# variables in the data set
print(colnames(alc))
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
n_men <- nrow(filter(alc, sex == "M"))
n_women <- nrow(filter(alc, sex == "F"))
message("Number of students: ", n_men+n_women, " (", n_men, " men, ", n_women, " women)")
## Number of students: 370 (175 men, 195 women)

We will pick sex (M/F), goout (how much student goes out with friends), absences (number of school absences) and G3 (final exam score) as the explanatory variables for our logistic regression model. We hypothesize that G3 of each student is negatively correlated with alcohol consumption, and goout and absences are positively correlated. The students who drink a lot would be going out with their friends a lot, while attending school less than their peers who drink little. Social drinking when going out with friends would increase the alcohol consumption for these students. This would be associated with a lower final score, as the students are perhaps less interested in studying (and more interested in partying). We additionally hypothesize that there are more male heavy drinkers and fewer female heavy drinkers. This gives us 4 explanatory variables (sex, goout, absences and G3) to predict whether a student is a heavy drinker high_use (true/false).

model.matrix(~0+., data=alc) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag=FALSE, type="lower", lab=TRUE, lab_size=2)
Correlation matrix for student alcohol consumption data set.\label{fig:cm}

Figure 3.1: Correlation matrix for student alcohol consumption data set.

We examine the relationships between variables in the data as well as our chosen model by plotting the correlation matrix (figure 3.1). In this matrix, the correlation coefficients between variables in the data set are visualized (red = positive, blue = negative). The correlation coefficients for the variables in our chosen model are: \[ \rho_{\text{high_use},\text{sex (M)}} = 0.21, \ \rho_{\text{high_use},\text{goout}} = 0.36, \\ \rho_{\text{high_use},\text{absences}} = 0.22, \ \rho_{\text{high_use},\text{G3}} = -0.13. \] This suggests that the most significant variable in our model predicting high alcohol use is how often students go out with their friends, followed by the number of absences. This suggests that social drinking with friends contributes significantly to the alcohol intake of the students, possibly resulting in absences. In the group of heavy drinkers, a significant majority are men. These observations are in agreement with our previous hypothesis. The final exam score is negatively correlated with high alcohol use, but the correlation is perhaps less significant than initially thought. The number of failures, amount of study time and free time have higher correlation coefficients, so the final exam score may be a worse predictor of alcohol use. It should be noted that high alcohol use is most highly correlated with mean alcohol use and weekday/weekend alcohol use, as it is defined using those variables, so these variables will not be included in any model.

h1 <- ggplot(alc, aes(x = goout, fill = high_use)) + 
   geom_histogram(alpha = 0.7) + ylab("Density") + 
  ggtitle("Go out by alcohol consumption") + 
  scale_fill_brewer(palette="Set2")
h2 <- ggplot(alc, aes(x = absences, fill = high_use)) + 
  geom_density(alpha = 0.7) + ylab("Density") + 
  ggtitle("Absences by alcohol consumption") + 
  scale_fill_brewer(palette="Set2")
h3 <- ggplot(alc, aes(x = G3, fill = high_use)) + 
  geom_density(alpha = 0.7) + ylab("Density") + 
  ggtitle("Final exam score by alcohol consumption") + 
  scale_fill_brewer(palette="Set2")
  
b1 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + 
  ylab("Go out") + ggtitle("Go out by alcohol consumption and sex")
b2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + 
  ylab("Absences") + ggtitle("Absences by alcohol consumption and sex")
b3 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + 
  ylab("G3") + ggtitle("Final exam score by alcohol consumption and sex")

grid.arrange(h1, h2, h3, b1, b2, b3, ncol=3, nrow=2)
Distributions and box plots for different variables.\label{fig:bp}

Figure 3.2: Distributions and box plots for different variables.

The distributions and box plots for our chosen variables are shown in figure 3.2. About 53% of students are female, so there are roughly equal numbers of male and female students. Looking at the distributions, we can see that the students who are heavy drinkers tend to go out more, tend to have more absences and tend to have lower exam scores. In the case of goout and absences, the variance of the distributions appears larger. The median final exam scores for light drinkers and heavy drinkers looks about the same, though the distributions look quite different. High-scoring students tend to be mostly light drinkers. The exam score distribution for heavy drinkers has many peaks, but this may be because there are fewer heavy drinkers. The box plots show these variables separated by sex. Overall, high alcohol consumption appears to affect men more than women, as there are greater differences between men that are light drinkers and men that are heavy drinkers. Notably, for women the median number of absences and test score appears to be the same, while for men alcohol consumption has a more negative impact. Additionally, alcohol use appears to change the variability of results for the different groups.

We fit a logistic regression model using our 4 chosen parameters using the function glm() with the parameter family = "binomial". Since we have a factor variable sex, we subtract 1 from the formula to get separate coefficients for each sex. For the coefficients, we calculate the odds ratios (OR) and confidence intervals (CI) by exponentiating the output from the functions coef() and confint() respectively. The odds ratio tells us how much the outcome probability is affected by each coefficient. The full summary of the fitted model with coefficients, errors and \(p\)-values is shown in table 3.1. The \(p\)-values are calculated based on the Wald test.

Based on the odds ratios, the most significant variables contributing to high alcohol use are goout followed by absences. For men, the OR is larger than for women, so men are much more likely to be heavy drinkers. This agrees with our previous observations for these parameters, and agrees with our hypothesis. The exam score G3 OR is less than 1, so a higher exam score decreases the probability of high alcohol use as per our previous observations. The \(p\)-values for the sex, goout and absences variables are all statistically significant (\(< 0.05\)), while the G3 \(p\)-values are not. This disagrees with our hypothesis that the final exam score predicts alcohol use.

# fit logistic regression
model <- glm(high_use ~ sex + goout + absences + G3 - 1, data = alc, 
             family = "binomial")
model_summary <- summary(model)

# exponentiate coefficients to get odds ratio (and confidence interval)
odds_ratio <- coef(model) %>% exp
conf_int <- confint(model) %>% exp

fe <- data.frame("Variable"=c("Sex (F)", "Sex (M)", 
                         "Go out", "Absences", 
                         "G3"),
                 "Coefficient"=c(model_summary$coefficients[1,1], 
                         model_summary$coefficients[2,1], 
                         model_summary$coefficients[3,1], 
                         model_summary$coefficients[4,1], 
                         model_summary$coefficients[5,1]),
                 "Std.error"=c(model_summary$coefficients[1,2], 
                         model_summary$coefficients[2,2], 
                         model_summary$coefficients[3,2], 
                         model_summary$coefficients[4,2], 
                         model_summary$coefficients[5,2]),
                 "P-value"=c(model_summary$coefficients[1,4], 
                         model_summary$coefficients[2,4], 
                         model_summary$coefficients[3,4], 
                         model_summary$coefficients[4,4], 
                         model_summary$coefficients[5,4]),
                 "OR"=c(odds_ratio[1], 
                         odds_ratio[2], 
                         odds_ratio[3], 
                         odds_ratio[4], 
                         odds_ratio[5]),
                 "CI2.5"=c(conf_int[1,1], 
                         conf_int[2,1], 
                         conf_int[3,1], 
                         conf_int[4,1], 
                         conf_int[5,1]),
                 "CI97.5"=c(conf_int[1,2], 
                         conf_int[2,2], 
                         conf_int[3,2], 
                         conf_int[4,2], 
                         conf_int[5,2]))
fe <- format_engr(fe)
kable(fe, caption = "Fit summary for model with 4 parameters.")
Table 3.1: Fit summary for model with 4 parameters.
Variable Coefficient Std.error P.value OR CI2.5 CI97.5
Sex (F) \(-3.580\) \({703.6}\times 10^{-3}\) \({360.2}\times 10^{-9}\) \({27.87}\times 10^{-3}\) \({6.655}\times 10^{-3}\) \({105.7}\times 10^{-3}\)
Sex (M) \(-2.562\) \({683.5}\times 10^{-3}\) \({178.2}\times 10^{-6}\) \({77.17}\times 10^{-3}\) \({19.40}\times 10^{-3}\) \({285.3}\times 10^{-3}\)
Go out \({705.5}\times 10^{-3}\) \({121.7}\times 10^{-3}\) \({6.681}\times 10^{-9}\) \(2.025\) \(1.605\) \(2.588\)
Absences \({82.63}\times 10^{-3}\) \({22.67}\times 10^{-3}\) \({267.9}\times 10^{-6}\) \(1.086\) \(1.040\) \(1.138\)
G3 \({-45.08}\times 10^{-3}\) \({39.85}\times 10^{-3}\) \({258.0}\times 10^{-3}\) \({955.9}\times 10^{-3}\) \({883.8}\times 10^{-3}\) \(1.034\)

We fit the model again, this time with only the variables sex, goout and absences. Now all of the variables in our model have a statistically significant relationship with alcohol use. Using this model, we predict the response for the training data and calculate the training loss as \(L_\text{train} = \text{E}[|\text{high_use}-p_\text{high_use}| > 0.5]\), where \(\text{high_use}\) is the training data and \(p_\text{high_use}\) is the model prediction (probability). The training loss represents the fraction of students that were misclassified. For our model with 3 parameters, we get \(L_\text{train} = 0.2108\), which means that the model classifies the alcohol use of students with 79% accuracy. According to the cross tabulation of predictions vs. actual values in table 3.2, we can see that the model is underestimating the number of heavy drinkers. It most commonly misclassifies students that are heavy drinkers as light drinkers. This is nonetheless significantly better than guessing (e.g. flipping a coin), which would give us a probability of \(p(\text{TT} \cup \text{FF}) = 0.5 \times 0.7 + 0.5 \times 0.3 = 0.5\). We also calculate this random guess loss random_err in the code below, which gives a 50% error for random guessing.

The values for high_use and predictions from the model are plotted in figure 3.3. Predictions with probabilities \(>0.5\) are classified as high_use, visualized using colors. We can see how the prediction errors for heavy drinkers are visible in this plot.

# fit better logistic regression model
model <- glm(high_use ~ sex + goout + absences - 1, data = alc, 
             family = "binomial")
model_summary <- summary(model)

# loss function
loss <- function(data, fit) mean(abs(data - fit) > 0.5)

# add predicted values for data set
alc_model <- data.frame(high_use=alc$high_use,
                        high_use_prob=predict(model, type = "response"))
alc_model <- mutate(alc_model, high_use_predict=high_use_prob>0.5)

# calculate prediction error for random guess
n_rand <- 100
random_err <- 0
for (i in 1:n_rand) {
  alc_random <- data.frame(high_use=alc$high_use,
                          high_use_prob=runif(nrow(alc), 0.0, 1.0))
  alc_random <- mutate(alc_random, high_use_predict=high_use_prob>0.5)
  
  random_err <- random_err + 
    loss(alc_random$high_use, alc_random$high_use_prob) / n_rand
}

# cross tabulation
cross_tab <- table(high_use = alc_model$high_use, 
      prediction = alc_model$high_use_predict) %>% prop.table %>% addmargins

# training error
train_err <- loss(alc_model$high_use, alc_model$high_use_prob)

message("Train err: ", train_err, ", random: ", random_err)

kable(data.frame(Prediction.FALSE=cross_tab[,1],
                 Prediction.TRUE=cross_tab[,2],
                 Sum=cross_tab[,3]), caption = "Cross tabulation.")
Table 3.2: Cross tabulation.
Prediction.FALSE Prediction.TRUE Sum
FALSE 0.6540541 0.0459459 0.7
TRUE 0.1648649 0.1351351 0.3
Sum 0.8189189 0.1810811 1.0
ggplot(alc_model, aes(x = high_use_prob, 
                     y = high_use, col = high_use_predict)) + 
  geom_point() + scale_color_brewer(palette="Set2")
Actual values for alcohol use vs. predicted.\label{fig:mp}

Figure 3.3: Actual values for alcohol use vs. predicted.

3.1 Bonus

To get a better estimate for the error (using the training set) for our model, we use 10-fold cross-validation (CV). The function cv.glm() is used with the parameter K = 10. This divides the test set in 10 folds, and fits the model 10 times. Since CV doesn’t use the training set for testing, we can get a better estimate of the error by averaging.

The CV error on the training for our 3-variable model is \(L_\text{CV} = 0.2142\), which is larger than the training loss \(L_\text{train} = 0.2108\), as is expected. The error for this model appears to be lower than that of the exercise set 3 model.

cv_err <- 0
n_cv <- 100
for (i in 1:n_cv) {
  alc_cv <- cv.glm(data = alc, 
                   cost = loss, 
                   glmfit = model, K = 10)
  cv_err <- cv_err + alc_cv$delta[1] / n_cv
}
message("CV: ", cv_err)
## CV: 0.214594594594595

3.2 Super-Bonus

Finally, we attempt to find a better model by running cross-validation for a large number of different variables. The R code below starts from a list of some of the most highly correlated variables, and starts fitting models in a loop. For each fit, we remove the variable with the largest \(p\)-value from the set of fitted variables. This removes the worst variable from the model, so the number of parameters in the model decreases, i.e. the model flexibility decreases. For each of these fits, we additionally record the training loss \(L_\text{train}\) and cross-validation error \(L_\text{CV}\). The refitting process is continued until we run out of variables.

The best found model has a training loss of \(L_\text{train}=0.2162\) and CV error \(L_\text{CV}=0.2094\). This model consists of the 5 variables sex, goout, absences, famrel and studytime. All of these variables are statistically significant.

The training and CV errors are plotted in figure 3.4. We can see the training error decreases with the model flexibility. The CV error decreases to a minimum at \(5\) parameters, before starting to increase again. These results are a consequence of the bias-variance tradeoff. As model flexiblity increases, bias decreases and variance increases. The training error approaches 0 as the bias decreases, but the testing error is a combination of both factors. The CV error approximates the test error, which is why we see a minimum where both factors are balanced.

flex_err <- data.frame(matrix(nrow = 0, ncol = 3))

# intial model variables to fit
model_variables <- c("age", "sex", "goout", "failures", "absences", 
                     "freetime", "famrel", "higher", "studytime", 
                     "traveltime", "address", "G1", "G2", "G3")
alc_i <- select(alc, all_of(c("high_use", model_variables))) # initial data set
n_var <- length(model_variables)
for (i in 1:n_var) { # loop through models
  # fit model with all columns in alc_i
  model_i <- glm(high_use ~ ., data = alc_i,
             family = "binomial")
  model_i_summary <- summary(model_i)
  #print(model_i_summary)
  # model flexibility
  flex = length(model_variables)
  
  # get index of variable with greatest p-value (not intercept)
  max_p <- which.max(model_i_summary$coefficients[2:(length(model_variables)+1),4])
  if (max_p > 0) { # remove largest p-value variable from data set
    rm_col <- model_variables[max_p]
    model_variables <- model_variables[-max_p]
    alc_i <- select(alc_i, -one_of(rm_col))
  }
  
  # calculate training error
  prob_i <- predict(model_i, type = "response")
  train_err_i <- loss(alc_i$high_use, prob_i)
  #message("Train loss: ", train_err_i)
  
  # calculate CV error
  cv_err <- 0
  n_cv <- 10
  for (i in 1:n_cv) {
    alc_cv <- cv.glm(data = alc_i, 
                     cost = loss, 
                     glmfit = model_i, K = 10)
    cv_err <- cv_err + alc_cv$delta[1] / n_cv
  }
  #message("CV: ", cv_err)
  
  flex_err <- rbind(flex_err, data.frame(flex=flex, 
                                         train_err=train_err_i, cv_err=cv_err))
}
ggplot() + 
  geom_line(data = flex_err, aes(x = flex, y = train_err, col="Train")) +
  geom_line(data = flex_err, aes(x = flex, y = cv_err, col="CV")) +
  labs(x = "Flexibility", y="Error", color="Error")
Training and CV error vs. model flexibilty.\label{fig:cv}

Figure 3.4: Training and CV error vs. model flexibilty.


4 Clustering and classification (week 4)

Clustering and classification are statistical methods that can be used to investigate the relationships between variables in a data set and partition the data set into groups. Classification can be used when we have a training data set available which contains the classes each data point belongs to. Clustering can be used even when this information is not available, by partitioning the data set based on distance of the variables.

We will investigate crime rates in suburbs of Boston using the Boston data set from the MASS library4. This data set has 506 rows and 14 variables, containing various statistics from suburbs of Boston, such as land zoning, location, housing values and population demographics. The variables are (roughly described) crim (crime rate per capita), zn (proportion of residential land zoned for large lots), indus (proportion of industrial land), chas (close to Charles River), nox (nitrous oxides concentration), rm (average number of rooms per home), age (proportion of old homes), dis (distance to Boston employment centres), rad (accessibility of radial highways), tax (property tax rate), ptratio (pupil-teacher ratio), black (proportion of black people), lstat (proportion of lower status), medv (median value of homes). The exact definitions of these variables are given in the MASS library documentation. We are interested in how the crime rate is affected in each suburb by the other variables. Using classification, we want to predict the crime rate in each suburb using the other variables using classification. Using clustering, we want to find out if the data set can be partitioned based on the variables in a meaningful way.

library(MASS)

# load Boston data set from MASS library
data("Boston")

# explore structure of Boston data set
# 506 rows, 14 variables
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
boston_summary <- summary(Boston)

# create summary table for data set
summary_table <- function(data, cap) {
  data_summary <- as.data.frame(apply(data, 2, summary))
  kable(data_summary, caption = cap, digits=2) %>%
  kable_styling(font_size = 12)
}

# make summary of Boston data set
summary_table(Boston, "Unscaled Boston data set summary.")
Table 4.1: Unscaled Boston data set summary.
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. 0.01 0.00 0.46 0.00 0.38 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
1st Qu. 0.08 0.00 5.19 0.00 0.45 5.89 45.02 2.10 4.00 279.00 17.40 375.38 6.95 17.02
Median 0.26 0.00 9.69 0.00 0.54 6.21 77.50 3.21 5.00 330.00 19.05 391.44 11.36 21.20
Mean 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65 22.53
3rd Qu. 3.68 12.50 18.10 0.00 0.62 6.62 94.07 5.19 24.00 666.00 20.20 396.22 16.96 25.00
Max. 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00

A summary of the variables in the Boston data set is given in table 4.1. Histograms for each of the variables are plotted in figure 4.1. We can see that the different variables have different means and standard deviations, depending on how those variables are defined. The median crime rate is 0.26, while the mean crime rate is 3.61, which suggests that there are many areas with low crime rates, while a few areas have high crime rates. The residential and industrial land variables zn and indus have approximately the same mean of about 11%. It’s notable that the median for zn is 0%, which could mean that there are many suburbs with only small zoned lots. The variable chas is a dummy variable, so it’s 1 for suburbs bounding the river and 0 otherwise. The mean concentration of nitrogen oxides nox is 0.55 parts per 10 million, and the distribution is fairly even. The average number of rooms in homes is 6, and most of the owner-occupied homes have been built prior to 1940. The mean distance to employment centers dis and accessibility of radial highways rad are 3.8 and 9.55 respectively. Many suburbs are close to employment centers and have good accessibility to radial highways, while there also are many suburbs that have poor accessibility. The property has some suburbs with low tax rates and some with high tax rates, having a mean of 408. The mean pupil to teacher ratio is 18, with some amount of variation between suburbs. Many suburbs have a sizable proportion of black people a few have only a small proportion (minimum 0.32, while 1st quantile is 375.38). There are a many suburbs with a larger low-status population, while there are fewer suburbs with a smaller low-status population. The value of homes has a mean of $23000, ranging from $5000 to $50000.

# plot histogram from vector
vhist <- function(var, xlab) ggplot() + aes(var) + 
  geom_histogram() + xlab(xlab) + ylab("Count")

# plot histograms for all variables in boston data set
boston_hist <- function(data) {
  h1 <- vhist(data$crim, "Crime per capita")
  h2 <- vhist(data$zn, "Proportion of residential land")
  h3 <- vhist(data$indus, "Proportion of industrial land")
  h4 <- vhist(data$chas, "Bounds river")
  h5 <- vhist(data$nox, "Nitrogen oxides concentration")
  h6 <- vhist(data$rm, "Rooms per dwelling")
  h7 <- vhist(data$age, "Proportion of old homes")
  
  h8 <- vhist(data$dis, "Distance to employment centres")
  h9 <- vhist(data$rad, "Accessibility to radial highways")
  h10 <- vhist(data$tax, "Property tax rate")
  h11 <- vhist(data$ptratio, "Pupil to teacher ratio")
  h12 <- vhist(data$black, "Proportion of black people")
  h13 <- vhist(data$lstat, "Lower status of population")
  h14 <- vhist(data$medv, "Value of homes")
  
  grid.arrange(h1, h2, h3, h4, h5, h6, h7, 
               h8, h9, h10, h11, h12, h13, h14, 
               ncol=7, nrow=2)
}

# plot histogram of variables
boston_hist(Boston)
Histograms of variables in unscaled Boston data set.\label{fig:hi4}

Figure 4.1: Histograms of variables in unscaled Boston data set.

We can examine how the variables in the Boston data set are related to each other by plotting the correlation matrix. In figure 4.2, the correlation coefficients between each of the variables are listed (lower half) and visualized using circles (upper half). The statistical significance level is shown inside of each circle using stars (*** 0.001, ** 0.01, * 0.05). According to the figure, many of the variables in the Boston data set are highly correlated with each other. The most significant variables correlated with the crime rate are rad, tax and lstat. This means that accessibility to radial highways and tax rate contributes the crime rate, while the low-status population also plays a significant role. This could mean that crime rate is higher in central parts of the city which also have a higher low-status population.

# calculate correlation coefficient matrix for boston data set
boston_cor <- cor(Boston) %>% round(digits = 2)

# calculate 95% confidence intervals and p-values
boston_conf <- cor.mtest(Boston, conf.level = .95)
boston_p <- boston_conf$p
# drop lower p-values
boston_p[lower.tri(boston_p, diag = T)] <- 1

# create correlation matrix with coefficients and p-values
corrplot.mixed(boston_cor, 
         p.mat = boston_p,
         upper = "circle",
         lower = "number",
         insig = "label_sig",
         pch.col = "black",
         pch.cex = 1.0,
         lower.col = "black",
         sig.level = c(.001, .01, .05))
Correlation matrix of Boston data set with correlation coefficients and $p$-values.\label{fig:cm4}

Figure 4.2: Correlation matrix of Boston data set with correlation coefficients and \(p\)-values.

We standardize the Boston data set by shifting the mean of each variable \(\mu\) to 0 and scaling the standard deviation \(\sigma\) to 1. Each variable \(x\) is standardized by \(x' = \frac{x - \mu}{\sigma}\) using the function scale(). The summary of the scaled Boston data set is shown in table 4.2 and histograms of the variables are shown in figure 4.3. We can see that the means of each variable have been shifted to 0. Similarly, the distributions of each variable have been scaled to a more similar range. The distributions of the different variables have different shapes (which are unchanged), so the ranges are not exactly the same, but they have the same standard deviation.

We create a categorical (factor) variable crime from the crim by splitting the crime rate distribution along the 1st and 3rd quantiles. This results in 4 classes of crime rate: low, med_low, med_high and high. The old crime rate crim is removed from the data set.

We split the scaled Boston data set into two data sets: train and test. The test data set is created by randomly sampling 80% of the data in the original scaled Boston data set. The remaining 20% is the test data set. The target variable crime is removed from the test data set.

# normalize boston data set by scaling mean and standard deviation
boston_scaled <- as.data.frame(scale(Boston))

# check normalized data set
boston_scaled_summary <- summary(boston_scaled)

# make summary of scaled Boston data set
summary_table(boston_scaled, "Scaled Boston data set summary.")
Table 4.2: Scaled Boston data set summary.
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. -0.42 -0.49 -1.56 -0.27 -1.46 -3.88 -2.33 -1.27 -0.98 -1.31 -2.70 -3.90 -1.53 -1.91
1st Qu. -0.41 -0.49 -0.87 -0.27 -0.91 -0.57 -0.84 -0.80 -0.64 -0.77 -0.49 0.20 -0.80 -0.60
Median -0.39 -0.49 -0.21 -0.27 -0.14 -0.11 0.32 -0.28 -0.52 -0.46 0.27 0.38 -0.18 -0.14
Mean 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3rd Qu. 0.01 0.05 1.01 -0.27 0.60 0.48 0.91 0.66 1.66 1.53 0.81 0.43 0.60 0.27
Max. 9.92 3.80 2.42 3.66 2.73 3.55 1.12 3.96 1.66 1.80 1.64 0.44 3.55 2.99
set.seed(6344) # get rid of randomness

boston_hist(boston_scaled)
Histograms of variables in scaled Boston data set.\label{fig:his4}

Figure 4.3: Histograms of variables in scaled Boston data set.

# categorize crime into 4 categories low, med_low, med_high and high by quantile
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), 
              include.lowest = TRUE, 
             labels = c("low", "med_low", "med_high", "high"))
# remove old crim variable from scaled data set
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add new categorical variable
boston_scaled <- data.frame(boston_scaled, crime)

# randomly select rows from data in ratio 4/5
boston_n <- nrow(boston_scaled)
boston_ind <- sample(boston_n,  size = boston_n * 0.8)

# create training data set by selecting 4/5 from original data set
boston_train <- boston_scaled[boston_ind,]

# create test data set by selecting 1/5 from original data set
boston_test <- boston_scaled[-boston_ind,]
# save crime target variable from test data set
boston_test_crime <- boston_test$crime
# remove the target variable from test
boston_test <- dplyr::select(boston_test, -crime)

We fit the linear discriminant analysis (LDA) on the training data set using the function lda(). A function plot_lda() is written using ggplot2 to visualize the LDA model. This LDA biplot is shown in figure 4.4. The different crime classes are visualized using colors, as well as normal data ellipses. Different components are visualized using arrows, with the length (and opacity) of each arrow calculated as \(l = \sqrt{\text{LD1}^2+\text{LD2}^2}\). We can see that the variable rad is the most significant separator between the classes, followed by zn and nox. It looks like the suburbs are clearly separated into two groups: those with high crime and those with less crime.

# draw lda biplot
plot_lda <- function(fit, data) {
  # get LD1, LD2 from fit
  lda_predict = predict(fit)$x
  predict_data = data.frame(
    LD1 = lda_predict[,1],
    LD2 = lda_predict[,2],
    Crime = data$crime)
  
  # calculate data for biplot arrows (length, angle)
  lda_data <- data.frame(var=rownames(coef(fit)), coef(fit))
  lda_data$length <- with(lda_data, sqrt(LD1^2 + LD2^2))
  lda_data$angle <- atan2(lda_data$LD2, lda_data$LD1)
  # text positions
  lda_data$x_end <- cos(lda_data$angle) * lda_data$length
  lda_data$y_end <- sin(lda_data$angle) * lda_data$length
  
  # plot points, normal data ellipses and arrows
  ggplot(data=predict_data, aes(x=LD1,y=LD2,col=Crime)) + geom_point() + 
    stat_ellipse(aes(fill = Crime), geom = "polygon", alpha = .3) +
    geom_spoke(aes(0, 0, angle = angle, alpha = length, radius = length), 
               lda_data, color = "red", size = 0.5, 
               arrow = arrow(length = unit(0.2, "cm"))) +
    geom_text(aes(y = y_end, x = x_end, label = var, alpha = length),
            lda_data, size = 4, vjust = .5, hjust = 0, color = "red")
}

# fit linear discriminant analysis on train data set
boston_lda <- lda(crime ~ ., data = boston_train)

# calculate LDA length
lda_data2 <- data.frame(var=rownames(coef(boston_lda)), coef(boston_lda))
lda_data2$length <- with(lda_data2, sqrt(LD1^2 + LD2^2))

# make LDA biplot
plot_lda(boston_lda, boston_train)
LDA biplot for Boston training set.\label{fig:lda4}

Figure 4.4: LDA biplot for Boston training set.

We predict the crime classes using the LDA model with the test data set and compare the predictions to the correct crime classes. These predictions are cross-tabulated in table 4.3. The LDA model prediction accuracy is 57% for the low class, 80% for med_low, 53% for med_high and 96% for high. The model misclassifies some low points as med_low and some med_high points as med_low. The total error rate of the model is 30%.

# predict values from LDA using test data set
boston_predict_test <- predict(boston_lda, newdata = boston_test)

# cross-tabulation for predicted crime class
cross_tab <- table(correct = boston_test_crime, 
                    predict = boston_predict_test$class)
rownames(cross_tab) <- paste(rownames(cross_tab), ".correct", sep="")
colnames(cross_tab) <- paste(colnames(cross_tab), ".predict", sep="")

kable(cross_tab, caption = "Cross-tabulation for predicted crime class from LDA.") %>% 
 kable_styling()
Table 4.3: Cross-tabulation for predicted crime class from LDA.
low.predict med_low.predict med_high.predict high.predict
low.correct 13 9 1 0
med_low.correct 2 12 1 0
med_high.correct 0 17 19 0
high.correct 0 0 1 27

Distances between observations in the scaled Boston data set are calculated using the dist() function with methods euclidean and manhattan. These methods calculate the distance \(d\) between observations as \(d^2 = \sum_i \Delta x_i^2\) and \(d = \sum_i |\Delta x_i|\) respectively (summing over dimension \(i\)). The Manhattan distance is the \(L^1\) norm and the Euclidean distance is the \(L^2\) norm. Summaries of these distances are shown in table 4.4. We can see that all the values for the Manhattan distance are larger than for the Euclidean distance, which is what we would expect from the triangle inequality.

# reload scaled boston data set
boston_scaled <- as.data.frame(scale(Boston))

# calculate Euclidean distance matrix sqrt(dx^2 + dy^2)
boston_dist <- dist(boston_scaled)
# calculate Manhattan distance matrix |dx| + |dy|
boston_dist_man <- dist(boston_scaled, method = "manhattan")

# plot distances as graphs
# library(qgraph)
# par(mfrow=c(1, 2))
# qgraph(1 / as.matrix(boston_dist), layout='spring', vsize=3)
# title("Euclidean",line=2.5)
# qgraph(1 / as.matrix(boston_dist_man), layout='spring', vsize=3)
# title("Manhattan",line=2.5)
dist_summary <- apply(data.frame(boston_dist), 2, summary)
dist_man_summary <- apply(data.frame(boston_dist_man), 2, summary)
dist_table <- data.frame(Euclidean=unname(dist_summary), 
                 Manhattan=unname(dist_man_summary))
rownames(dist_table) <- row.names(dist_summary)
kable(dist_table, 
      caption = "Summary of Euclidean and Manhattan distances.", digits=2) %>% 
  kable_styling()
Table 4.4: Summary of Euclidean and Manhattan distances.
Euclidean Manhattan
Min. 0.13 0.27
1st Qu. 3.46 8.48
Median 4.82 12.61
Mean 4.91 13.55
3rd Qu. 6.19 17.76
Max. 14.40 48.86

We run the \(k\)-means algorithm on the scaled Boston data set, first with \(k=3\). To find the optimal value for \(k\), we calculate the total within-cluster sum of squares (TWCSS) for a range \(k \in [1, 10]\). These are averaged over 10 runs to reduce random variation. The resulting graph is plotted in figure 4.5. We can see that the largest reduction in TWCSS occurs at \(k=2\), while for larger \(k\) the differences are less significant. According to the elbow method, the optimal value would likely be \(k=2\).

boston_km <- kmeans(boston_scaled, centers = 3)

k_max <- 10 # maximum k
km_n <- 10 # number of k-means twcss to average
twcss <- rep(0, k_max)
for (i in 1:km_n) {
  twcss <- twcss + sapply(1:k_max, function(k) kmeans(boston_scaled, k)$tot.withinss) / km_n
}

ggplot() + aes(x = 1:k_max, y = twcss) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=1:k_max) +
  xlab("k") + ylab("TWCSS")
Total within-cluster sum of squares for different $k$.\label{fig:km4}

Figure 4.5: Total within-cluster sum of squares for different \(k\).

k_best <- which.max(twcss[1:(k_max-1)]-twcss[2:k_max])+1

boston_km <- kmeans(boston_scaled, centers = k_best)

The \(k\)-means clusters are plotted using ggpairs() in figure 4.6. The scatter plots and histograms are colored based on which of the two clusters the data points belong to. We can see that the scatter plots show quite good separation between the groups for most combinations of variables, and the blue and red areas mostly don’t overlap. Similarly, we see that the distributions of the variables are mostly different, though there are exceptions (the variables chas and rm for example). It’s also notable that the correlation coefficients between variables are different for the two clusters. These observations suggest that the two clusters represent some actual separate groups in the data. Based on 4.7, we can determine that the crime rate is effectively separated by cluster, and all the crime rates in lower cluster are below the mean. This would suggest that the data has been separated into low-crime and high-crime clusters.

ggpairs(boston_scaled, mapping = aes(color=factor(boston_km$cluster), alpha=0.7))
Pair plot of scaled Boston data set colored by $k$-means cluster.\label{fig:pp4}

Figure 4.6: Pair plot of scaled Boston data set colored by \(k\)-means cluster.

crime_cluster <- data.frame(crim=boston_scaled$crim, cluster=factor(boston_km$cluster))
ggplot(crime_cluster, aes(x = crim, col = cluster)) + geom_boxplot()
Box plot of crime rate by cluster.\label{fig:hc4}

Figure 4.7: Box plot of crime rate by cluster.

4.1 Bonus

We run \(k\)-means clustering for \(k \in [2, 4]\) for the scaled Boston data set. LDA is then run on the data set with the \(k\)-means cluster as the target variable. The LDA biplots are shown in figure 4.8, colored by cluster. We get perhaps the best separation using \(k=4\). Depending on number of clusters, we get different variables that are influential. The lengths of the different variable vectors are given in table 4.5. For \(k=3\), the most significant variables are rad, tax and age. For \(k=4\), the most significant variables are rad, zn and tax. For \(k=5\), the most significant variables are black, nox and tax. The many of these most significant variables are closely related to location, which seems to be important in determining crime rate.

set.seed(6344) # get rid of randomness
# reload scaled boston data set
boston_scaled <- as.data.frame(scale(Boston))

# draw lda biplot
plot_lda <- function(fit, data) {
  # get LD1, LD2 from fit
  lda_predict = predict(fit)$x
  predict_data = data.frame(
    LD1 = lda_predict[,1],
    LD2 = lda_predict[,2],
    Cluster = factor(data$cluster))
  
  # calculate data for biplot arrows (length, angle)
  lda_data <- data.frame(var=rownames(coef(fit)), coef(fit))
  lda_data$length <- with(lda_data, sqrt(LD1^2 + LD2^2))
  lda_data$angle <- atan2(lda_data$LD2, lda_data$LD1)
  # text positions
  lda_data$x_end <- cos(lda_data$angle) * lda_data$length
  lda_data$y_end <- sin(lda_data$angle) * lda_data$length
  
  # plot points, normal data ellipses and arrows
  ggplot(data=predict_data, aes(x=LD1,y=LD2,col=Cluster)) + geom_point() + 
    stat_ellipse(aes(fill = Cluster), geom = "polygon", alpha = .3) +
    geom_spoke(aes(0, 0, angle = angle, alpha = length, radius = length), 
               lda_data, color = "red", size = 0.5, 
               arrow = arrow(length = unit(0.2, "cm"))) +
    geom_text(aes(y = y_end, x = x_end, label = var, alpha = length),
            lda_data, size = 4, vjust = .5, hjust = 0, color = "red")
}

boston_km3 <- kmeans(boston_scaled, centers = 3)
# fit linear discriminant analysis with k-mean cluster as target
boston_lda3 <- lda(boston_km3$cluster ~ ., data = boston_scaled)
lda3 <- plot_lda(boston_lda3, boston_km3)
lda_data3 <- data.frame(var=rownames(coef(boston_lda3)), coef(boston_lda3))
lda_data3$length <- with(lda_data3, sqrt(LD1^2 + LD2^2))

boston_km4 <- kmeans(boston_scaled, centers = 4)
# fit linear discriminant analysis with k-mean cluster as target
boston_lda4 <- lda(boston_km4$cluster ~ ., data = boston_scaled)
lda4 <- plot_lda(boston_lda4, boston_km4)
lda_data4 <- data.frame(var=rownames(coef(boston_lda4)), coef(boston_lda4))
lda_data4$length <- with(lda_data4, sqrt(LD1^2 + LD2^2))

boston_km5 <- kmeans(boston_scaled, centers = 5)
# fit linear discriminant analysis with k-mean cluster as target
boston_lda5 <- lda(boston_km5$cluster ~ ., data = boston_scaled)
lda5 <- plot_lda(boston_lda5, boston_km5)
lda_data5 <- data.frame(var=rownames(coef(boston_lda5)), coef(boston_lda5))
lda_data5$length <- with(lda_data5, sqrt(LD1^2 + LD2^2))

grid.arrange(lda3, lda4, lda5, ncol=3, nrow=1)
LDA biplots for $k$-means clusters.\label{fig:lk4}

Figure 4.8: LDA biplots for \(k\)-means clusters.

llen_table <- data.frame(LDA3Length=lda_data3$length, 
                         LDA4Length=lda_data4$length,
                         LDA5Length=lda_data5$length)
rownames(llen_table) <- lda_data3$var

kable(llen_table, 
      caption = "LDA lengths for different variables and different $k$.", digits=2) %>% 
  kable_styling()
Table 4.5: LDA lengths for different variables and different \(k\).
LDA3Length LDA4Length LDA5Length
crim 0.21 0.19 0.71
zn 0.36 1.51 0.17
indus 0.34 0.42 0.73
chas 0.14 0.12 0.02
nox 0.20 0.04 0.94
rm 0.46 0.18 0.48
age 0.86 0.60 0.22
dis 0.63 0.54 0.10
rad 2.05 6.14 0.53
tax 1.23 0.64 0.89
ptratio 0.13 0.31 0.34
black 0.17 0.04 2.35
lstat 0.20 0.14 0.28
medv 0.37 0.15 0.15

4.2 Super-Bonus

We plot the LDA and \(k\)-means for the Boston training data set in 3D using the matrix product and function plot_ly() as given. For the LDA plot, the color is the crime class, while the \(k\)-means plot is colored using the cluster. We choose \(k=4\) for plotting.

We can see that the LDA plot shows a similar shape as previously, except now the third dimension LD3 is also shown. We see good separation between the classes, especially the high class and the others. For the \(k\)-means plot, we see a fairly similar result. There is a clear difference between one group and the rest, though in LDA this separation is perhaps cleaner. Similarly, there is separation between clusters in the larger set of clusters, but perhaps with clearer spatial separation. In the LDA plot the classes are overlapping more than the \(k\)-means clusters, which makes sense as the clustering is based on distance. The \(k\)-means clustering appears to be somewhat approximating the crime classification.

set.seed(1919) # get rid of randomness
model_predictors <- dplyr::select(boston_train, -crime)
boston_train_km4 <- kmeans(model_predictors, centers = 4)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(boston_lda$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% boston_lda$scaling
matrix_product <- as.data.frame(matrix_product)

lda1 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, 
        color = boston_train$crime, type = 'scatter3d', mode='markers', 
        marker=list(size = 3)) %>% 
  layout(title = "Crime classes", scene = list(xaxis = list(title = 'LD1'),
                     yaxis = list(title = 'LD2'),
                     zaxis = list(title = 'LD3')))

lda2 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
        color = factor(boston_train_km4$cluster), type = 'scatter3d', mode='markers',
        marker=list(size = 3)) %>%
  layout(title = "K-means clusters", scene = list(xaxis = list(title = 'LD1'),
                     yaxis = list(title = 'LD2'),
                     zaxis = list(title = 'LD3')))
Harrison, Ewen, and Riinu Pius. 2021. R for Health Data Science. https://argoshare.is.ed.ac.uk/healthyr_book/.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: with Applications in R. Springer.
Vehkalahti, Kimmo. 2020. IODS-project: Template for the IODS course.” https://github.com/KimmoVehkalahti/IODS-project/.
Vehkalahti, Kimmo, and Brian Everitt. 2019. Multivariate Analysis for the Behavioral Sciences. CRC Press.

  1. http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt↩︎

  2. http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt↩︎

  3. http://www.archive.ics.uci.edu/dataset/320/student+performance↩︎

  4. https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html↩︎